On Boosting and Noisy Labels
نویسندگان
چکیده
Boosting is a machine learning technique widely used across many disciplines. Boosting enables one to learn from labeled data in order to predict the labels of unlabeled data. A central property of boosting instrumental to its popularity is its resistance to overfitting. Previous experiments provide a margin-based explanation for this resistance to overfitting. In this thesis, the main finding is that boosting’s resistance to overfitting can be understood in terms of how it handles noisy (mislabeled) points. Confirming experimental evidence emerged from experiments using the Wisconsin Diagnostic Breast Cancer(WDBC) dataset commonly used in machine learning experiments. A majority vote ensemble filter identified on average that 2.5% of the points in the dataset as noisy. The experiments chiefly investigated boosting’s treatment of noisy points from a volume-based perspective. While the cell volume surrounding noisy points did not show a significant difference from other points, the decision volume surrounding noisy points was two to three times less than that of non-noisy points. Additional findings showed that decision volume not only provides insight into boosting’s resistance to overfitting in the context of noisy points, but also serves as a suitable metric for identifying which points in a dataset are likely to be mislabeled. Thesis Supervisor: Patrick H. Winston Title: Professor
منابع مشابه
An Information-Theoretic Discussion of Convolutional Bottleneck Features for Robust Speech Recognition
Convolutional Neural Networks (CNNs) have been shown their performance in speech recognition systems for extracting features, and also acoustic modeling. In addition, CNNs have been used for robust speech recognition and competitive results have been reported. Convolutive Bottleneck Network (CBN) is a kind of CNNs which has a bottleneck layer among its fully connected layers. The bottleneck fea...
متن کاملAnomaly Detection In Labeled Data
Noisy points in training data maybe due to incorrect class labels or erroneous recording of attribute values. These points greatly influence the orientation of the classification boundary. In this paper, we formalize two notions of noisy points: intrusive outliers and hard-to-classify points. We adapt two well-known distance-based notions of outliers in unlabeled data to formalize intrusive out...
متن کاملDiscriminative Training of Accoustic Models for System Combination
In discriminative training methods, the objective function is designed to improve the performance of automatic speech recognition with reference to correct labels using a single system. On the other hand, system combination methods, which output refined hypotheses by a majority voting scheme, need to build multiple systems that generate complementary hypotheses. This paper aims to unify the bot...
متن کاملDiscriminative training of acoustic models for system combination
In discriminative training methods, the objective function is designed to improve the performance of automatic speech recognition with reference to correct labels using a single system. On the other hand, system combination methods, which output refined hypotheses by a majority voting scheme, need to build multiple systems that generate complementary hypotheses. This paper aims to unify the bot...
متن کاملEnsemble Neural Relation Extraction with Adaptive Boosting
Relation extraction has been widely studied to extract new relational facts from open corpus. Previous relation extraction methods are faced with the problem of wrong labels and noisy data, which substantially decrease the performance of the model. In this paper, we propose an ensemble neural network model Adaptive Boosting LSTMs with Attention, to more effectively perform relation extraction. ...
متن کامل